A novel estimator for the two-way partial AUC

Background The two-way partial AUC has been recently proposed as a way to directly quantify partial area under the ROC curve with simultaneous restrictions on the sensitivity and specificity ranges of diagnostic tests or classifiers. The metric, as originally implemented in the tpAUC R package, is estimated using a nonparametric estimator based on a trimmed Mann-Whitney U-statistic, which becomes computationally expensive in large sample sizes. (Its computational complexity is of order \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n_x n_y)$$\end{document}O(nxny), where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_x$$\end{document}nx and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n_y$$\end{document}ny represent the number of positive and negative cases, respectively). This is problematic since the statistical methodology for comparing estimates generated from alternative diagnostic tests/classifiers relies on bootstrapping resampling and requires repeated computations of the estimator on a large number of bootstrap samples. Methods By leveraging the graphical and probabilistic representations of the AUC, partial AUCs, and two-way partial AUC, we derive a novel estimator for the two-way partial AUC, which can be directly computed from the output of any software able to compute AUC and partial AUCs. We implemented our estimator using the computationally efficient pROC R package, which leverages a nonparametric approach using the trapezoidal rule for the computation of AUC and partial AUC scores. (Its computational complexity is of order \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n \log n)$$\end{document}O(nlogn), where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n = n_x + n_y$$\end{document}n=nx+ny.). We compare the empirical bias and computation time of the proposed estimator against the original estimator provided in the tpAUC package in a series of simulation studies and on two real datasets. Results Our estimator tended to be less biased than the original estimator based on the trimmed Mann-Whitney U-statistic across all experiments (and showed considerably less bias in the experiments based on small sample sizes). But, most importantly, because the computational complexity of the proposed estimator is of order \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n \log n)$$\end{document}O(nlogn), rather than \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n_x n_y)$$\end{document}O(nxny), it is much faster to compute when sample sizes are large. Conclusions The proposed estimator provides an improvement for the computation of two-way partial AUC, and allows the comparison of diagnostic tests/machine learning classifiers in large datasets where repeated computations of the original estimator on bootstrap samples become too expensive to compute.


Motivation
Diagnostic tests based on continuous scales play a critical role in clinical practice, where at-risk individuals from a population are often screened for a disease using the results of quantitative laboratory tests [1].Similarly, computer aided diagnostic technologies based on machine learning [2] (ML) are increasingly becoming an useful tool to aid clinicians in their decision making process [3][4][5][6][7].However, before a new diagnostic test or ML based diagnostic classifier can be deployed in clinical practice it is essential to evaluate its accuracy and compare its performance against established methods.For diagnostic tests based on continuous measurements, a positive test result is obtained when the measurement is above a given threshold.Similarly, for ML based classifiers, a positive diagnosis is obtained when the ML algorithm prediction (usually reported as the probability that the individual has the disease) is above a given threshold.The receiver operation characteristic curve (ROC) [8] plots the false positive rate (FPR) against the the true positive rate (TPR) across all possible threshold levels and is a widely used technique for visualizing the trade-offs between FPR and TPR at any threshold level of interest in both diagnostic testing based on continuous scales [9][10][11][12][13] and machine learning evaluations [8].The area under the ROC curve (AUC) is the most widely used criterium to summarize the information provided by the ROC curve across all thresholds.
For every diagnostic test/classifier there is a costbenefit analysis necessary for evaluating the costs of false positives versus false negatives.These costs can sometimes be measured economically as in unnecessary medical care costs for false positives, but also translate to societal and individual costs such as psychological strain due to false positive diagnosis or the ethical concerns due to the direct harm caused by a false negative diagnosis (which might lead to poor prognosis or even the death of the individual).As a consequence, in many applications, it is important to simultaneously maintain a low FPR and high TPR, and regulatory or policy making bodies may have predefined criteria for the allowable TPR and FPR in various scenarios based on the healthcare economics of false positive and false negative outcomes.For example, the World Health Organization (WHO) has set minimal requirements for community-based tuberculosis screening at sensitivity (TPR) above 90% and specificity (1-FPR) above 70% [14].For SARS-CoV-2 diagnostics, the WHO requirements for antigendetecting rapid diagnostic tests are sensitivity above 80% and specificity above 97% [15].Thus, in evaluating diagnostic tests/classifiers, we may only be interested in comparing their performances within the regulatory guideline bounds.
When clear guidelines exist, there is an advantage to compare different classifiers in a limited region of the ROC space instead of over the entire range.

The current two-way partial AUC estimator, and it's limitations
The two-way partial AUC [16] (tpAUC) has been proposed as a way to directly quantify the area under the ROC curve satisfying explicit constraints in both the TPR and FPR ranges.Previous approaches in the literature [17][18][19][20][21] focused on quantifying partial area under the ROC (pAUC) by directly restricting the FPR range, but only indirectly controlling for the TPR.However, as illustrated in [16], this indirect approach can be problematic making diagnostic test comparisons less efficient and, in some circumstances, leading to incorrect conclusions (see the illustrative examples presented in [16] for further details).
Yang et al. [16] proposed a non-parametric estimator for the tpAUC based on a trimmed version of the Mann-Whitney U statistic [22] and implemented the method in the R [23] package tpAUC .Based on this estimator, the authors also described a bootstrap [24] based testing approach, based on a asymptotic confidence interval, to compare two tpAUCs.In practice, an important caveat of the tpAUC estimator proposed by [16] (which is this paper will be denoted as the "original estimator", tpAUC o ) is that it is computationally expensive to calculate for large datasets.(Its computational complexity is of order O(n x n y ) , where n x and n y represent the number of positive and negative cases in the dataset, respectively.)Furthermore, because the statistical approach for comparing two tpAUCs requires bootstrap resampling, the approach can quickly become computationally impractical as sample sizes increase.

Our contribution
To circumvent this problem, in this paper we propose a computationally efficient estimator of tpAUC, denoted as the "proposed estimator", tpAUC p , which can be directly computed from the output of any software able to compute the standard ("full") AUC and the standard partial AUC metrics [17], which only impose restrictions in either the FPR (specificity) or the TPR (sensitivity) ranges, but not on both simultaneously.We implemented our estimator using the computationally efficient pROC [25] R package, which implements a nonparametric approach based on trapezoids, rather than (trimmed) Mann-Whitney U statistic estimators, for the computation of AUC and partial AUC scores.We compare the computation time and the empirical bias of the original and proposed estimators in a series of simulation studies.Furthermore, we use two real datasets to illustrate the computational efficiency of the proposed estimator (as opposed to the original one) when performing the statistical comparison of two tpAUCs from different classifiers built on the same dataset (which requires bootstrapping).
Our empirical comparisons show that the proposed estimator tended to be less biased than the original one across all experiments (and showed considerably less bias in the experiments based on small sample sizes).But, most importantly, it can be orders of magnitude faster to compute than the original estimator when sample sizes are large.

Technical background
Before we present our proposed estimator, in the rest of this section we first review the mathematical definitions and key properties of the concepts of: (i) area under the curve (AUC); (ii) partial area under the curve (pAUC); and (iii) two-way partial area under the curve (tpAUC).We start by introducing some notation.

Notation
Throughout the text X represents a continuous test result for the positive cases, e.g., subjects that have a disease, while Y represents the test result for the negative cases, e.g., non-disease subjects.(In the context of machine learning applications, X and Y represent the confidence score, i.e., the predicted probability of the positive class, generated by a ML algorithm.)For any threshold c, a subject is classified as a positive case if its test result (confidence score) is larger than c.Let F(x) and G(y) represent, respectively, the cumulative distribution functions of the X and Y variables, while S F (x) and S G (y) represent the respective survival functions.Then the true positive rate (TPR) at a threshold c, denoted by TPR(c), is defined as, while the corresponding false positive rate (FPR) at the threshold c, denoted by FPR(c), is defined as, Note the TPR is also denoted as sensitivity (SENS) of the test/classifier, while the specificity (SPEC) corresponds to 1 − FPR , so that SENS(c) = TPR(c) and Throughout the text we let D k = {X i , Y j } , i = 1, . . ., n x , j = 1, . . ., n y , represent a dataset containing n x positive cases and n y negative cases, and we assume that the cases are statistically independent of each other.We let ŜF,n x and ŜG,n y represent, respectively, the empirical versions of the survival functions S F and S G , so that for any u ∈ [0, 1] we have that Ŝ−1 , where X (i) and Y (j) represent the associated order statistics, and ⌊N ⌋ represents the largest integer smaller than N.

The area under the ROC curve (AUC)
For any given threshold c, the costs and benefits of a diagnostic test (or classifier) are associated with a given FPR/ TPR (specificity/sensitivity) pair.The ROC curve plots the {FPR(c), TPR(c)} pairs for all possible threshold values (1) Fig. 1 ROC curves.Panels a and b show, respectively, the ROC curve defined in terms of FPR/TPR and specificity/sensitivity.In this illustrative example, the ROC curve was parameterized according to the binormal ROC model [18], where c (as illustrated in Fig. 1a) and provides a visual description of the trade-offs between sensitivity and specificity as c changes.(Equivalently, the ROC curve can also be defined in terms of specificity and sensitivity by plotting {SPEC(c), SENS(c)} for all possible threshold values c, as shown in Fig. 1b.)Mathematically, for any given threshold c we can describe the ROC curve as a function of the respective FPR value u = S G (c) as ROC(u) = S F (c) = S F S −1 G (u) , where S −1 G (.) represents the inverse function of the survival func- tion S G (.) .Figure 1a provides an illustrative example where the red arrow represents the FPR value (u) associated with the threshold c while the blue arrow shows the respective ROC value as a function of u. (Note that c represents a threshold with values ranging in the support of the X and Y variables and, contrary to u, it can assume values outside the [0, 1] range.) Often times, it is unclear how to choose a threshold c and it is desirable to adopt a summary measure that aggregates information across all possible threshold values.The AUC, defined as, summarizes information across all thresholds and is the most commonly used metric of accuracy in diagnostic tests (and is also widely adopted in machine learning applications).A well known statistical property of the AUC is that it corresponds to the probability that a classifier will rank a randomly chosen positive case higher than a randomly chosen negative one, that is AUC = P(X > Y ) [8,11].
Empirical (nonparametric) AUC estimates can be efficiently computed from empirical ROC curves using the sum of trapezoids [8] as implemented in the pROC [25] R package.(As described in [8], the computational complexity of the AUC estimation is of order O(n log n) , where n = n x + n y .)Importantly, it has been shown by Bam- ber [26] that, in the special case that the continuous test results (or confidence scores) have no ties, the AUC can also be estimated using the Mann-Whitney U-statistic [22], where n x and n y represents the number of positive and negative cases, respectively, and 11{A} represents an indi- cator function assuming value 1 if event A occurs and 0 otherwise.(This estimator, however, has computational complexity of order O(n x n y ).)

The partial area under the curve (pAUC)
In some applications, the interest does not lie in the entire range of FPR (or TPR) values, but in only a portion of the ROC curve.In these situations, we can use the partial area under the curve (pAUC) [17,18,27,28] to summarize the information across the portion of the ROC curve that are of interest.As described in [17], the pAUC can be defined both in terms of restricting the FPR (or specificity) ranges or the TPR/sensitivity ranges.When focusing on the FPR range [u 0 , u 1 ] the pAUC is defined as [17], Note that the partial AUC focusing on FPR can be re-expressed in terms of specificity as, When focusing on sensitivity, the pAUC focusing on the TPR range [u ′ 0 , u ′ 1 ] is given by [17], Similarly to the full AUC, empirical (nonparametric) partial AUC estimates can be efficiently computed from empirical ROC curves.In the pROC R package, partial AUCs are computed by ignoring trapezoids outside the partial range and adding partial trapezoids with linear interpolation when necessary [25].Furthermore, as described in [17], the partial AUCs can also be estimated nonparametrically using trimmed versions of the Mann-Whitney U-statistic as, (5)

The two-way partial area under the curve (tpAUC)
In practice, we are often interested in areas of the ROC space where a diagnostic test performs above pre-specified sensitive and specificity thresholds or, equivalently, above a pre-specified TPR and below a pre-specified FPR.
The tpAUC, first proposed in [16], represents an intuitive performance measure which captures partial area under the ROC with explicit restrictions on both TPR and FPR.
For instance, suppose that we are only interested in sensitivity (TPR) values above the threshold b se , and specificity values above the threshold b sp (or, equivalently, on FPR values below the threshold b fpr = 1 − b sp ), so that we are only interested in the area in the ROC space confined to the red rectangle in Fig. 2.Then, the tpAUC as defined in [16] corresponds to the area under the ROC curve (Area A) inside the red rectangle.From Fig. 2a, we see that, graphically, the tpAUC (Area A) corresponds to the pAUC fpr (u 0 , u 1 ) (Area A + Area B) minus the Area B, where u 1 = 1 − b sp = b fpr corresponds to the upper bound on the FPR and u 0 = c fpr = S G S −1 F (b se ) corre- sponds to the FPR value associated with the sensitivity bound b se .Hence, from Eq. ( 5), we have that the tpAUC can be expressed mathematically as, (11) As described in [16], the tpAUC can also be interpreted probabilistically as, and can be estimated nonparametrically using the trimmed Mann-Whitney U statistic estimator, implemented in the tpAUC R package.
Furthermore, reference [16] also describes a bootstrap assisted approach for computing asymptotic confidence intervals for the difference between two tpAUCs (so that, different diagnostic tests/classifiers constructed in the same dataset can be compared statistically).( 12)

Methods
In this section we describe the proposed tpAUC estimator (as well as, an alternative estimator that can be marginally faster to compute).

The proposed tpAUC estimator ( tpAUC p )
Here, we describe a novel (and straight forward) estimator of tpAUC that can be directly computed from the output of any software that can estimate pAUC sp , pAUC se and the total AUC .For its derivation, consider the graphical representations (and respective probabilistic interpretations) of the AUC , pAUC sp , pAUC se , and tpAUC quantities presented in Fig. 3.
Figure 3a shows the graphical representation of pAUC sp restricted to the specificity interval [1, b sp ] (blue area), while Fig. 3b shows the graphical representation of pAUC se restricted to the sensitivity interval [b se , 1] (red area).(To simplify the exposition we drop the integration intervals from the notation and represent pAUC sp (1, b sp ) by pAUC sp and pAUC se (b se , 1) by pAUC se ). Figure 3c shows the representation of both quantities, simultaneously, and clearly shows that the tpAUC (colored area in Fig. 3d) corresponds to the intersection of the pAUC sp (blue area) and pAUC se (red area) depicted in Area I.
From Fig. 3c, we also have that, so that it is easy to see that, Hence, we see that tpAUC can be computed according to Eq. ( 19) whenever the intersection of the pAUC sp and pAUC se areas is not empty, but will be zero when- ever the pAUC sp and pAUC se areas do not intersect or, equivalently, if the ROC curve does not pass through the area of interest in the ROC space defined by the upper left corner rectangle corresponding to sensitivity values greater or equal to b se and specificity values greater or equal to b sp (e.g., the red rectangle in Fig. 3d).As described in Fig. 4a, this condition can be tested by simply checking whether the specificity value corresponding to the sensitivity threshold b se , defined mathematically as, , is smaller than the specificity threshold b sp .Therefore, an alternative expression for the tpAUC is, Observe that Eq. ( 20) has the following probabilistic interpretation.Let E 1 and E 2 represent the following events, (19)   then, if events E 1 and E 2 are not mutually exclusive, it fol- lows from the definition of the probability of the intersection of two events that, where it is easy to see that   21) shows the equivalent condition in terms of sensitivity values, where we see that the ROC curve does not cross the region of interest if the sensitivity at the b sp specificity, c se , is smaller than b se .When these conditions hold, we have that the tpAUC is 0. In our implementation, we check the condition in panel a

An alternative tpAUC estimator ( tpAUC a )
From Fig. 2, it is easy to see that an alternative way to compute the tpAUC (area A) using the output of any software that can estimate pAUC fpr and also c fpr (i.e., the FPR value that corresponds to the sensitivity threshold b se ) is to directly estimate the tpAUC by subtracting the area B (the rectangle given by (b fpr − c fpr )b se ) from the pAUC fpr (c fpr , b fpr ) which corresponds to the sum of areas A and B. In other words, we can directly estimate tpAUC using the alternative estimator, or alternatively in terms of specificity as, where ĉsp = 1 − ĉfpr .Since this alternative estimator does not require the computation of the AUC and pAUC se its computation can be potentially faster than of tpAUC p .

Results
In this section we present four sets of experiments comparing the tpAUC estimators described in this paper.We start by presenting some initial comparisons (based on simulated data) between the original and proposed estimators, followed by more systematic comparisons of the computation time requirements of the original, proposed, and alternative estimators, and a more systematic comparison of the bias of the original and proposed estimators.In our final set of experiments, we use two real datasets to illustrate the computational complexity of the original estimator when performing the statistical comparison of two tpAUCs from different classifiers built on the same dataset.

Initial comparisons
We initially compared the original estimator based on the trimmed Mann-Whitney U statistic estimator, tpAUC o , presented in Eq. ( 14) against the proposed estimator, tpAUC p , presented in Eq. ( 24) in 9 initial experiments encompassing all the combinations of sample sizes set to 100, 1,000 or 10,000 cases and the proportion of negative and positive cases set to (90%, 10%), (50%, 50%), and (10%, 90%), as described in Table 1.
Each simulation experiment was composed of 1,000 replications where the data from each replication was generated from the binormal ROC model [18] using a different set of parameters randomly sampled from the distributions, µ y ∼ U (0, 2) , µ x ∼ µ y + U (0.1, 1.1) , σ y ∼ U (0.5, 1.5) , and σ x ∼ U (0.5, 1.5) , and the sensitivity and specificity thresholds where randomly selected from the distributions b se ∼ U (0.2, 0.8) and b sp ∼ U (0.2, 0.8) .The tpAUC estimates and computation time associated with these initial experiments are reported in Figs. 5 and 6.
Figure 5 presents scatterplots of the tpAUC estimates obtained with the original and proposed estimators.While the estimates tend to be very close for larger sample sizes (see Fig. 5d, e, and f, and, especially, Fig. 5g, h, i), we observed a fair amount of discrepancy for the estimates based on small sample sizes (Fig. 5a, b, and c) and, in particular, for unbalanced proportions of negative and positive cases (Fig. 5a and c).These initial experiments suggest that either one or both of these estimators tended to be fairly biased for small sample sizes.Additional simulation experiments conclude that it is actually the original estimator that shows higher bias than the proposed estimator in this case (see the "Bias comparison" subsection below).
Figure 6 presents the computation times for the original (Fig. 6a) and proposed (Fig. 6b) estimators.(Computation times correspond to the "user" time reported by the system.timefunction from R base.)Each panel reports boxplots for all 9 experiments grouped by their sample sizes ( n = 100 for experiments 1, 2, and 3, n = 1, 000 for experiments 4, 5, and 6, and n = 10, 000 for experiments 7, 8, and 9) and also color coded according to the proportion of negative and positive cases (orange boxplots for experiments with 90%/10% negative/positive cases, purple boxplots for experiments with 50%/50% negative/positive cases, and black boxplots for experiments with 10%/90% negative/positive cases).All experiments reported in this section (and throughout   Figure 6a shows a very sharp increase in computation time for the original estimator when the sample size increases to 10,000.Interestingly, we observed that, within each sample size group, the original estimator tended to spent considerably more time to compute the results of the experiments based on a balanced proportion of negative and positive cases (purple boxplots) relative to the experiments with unbalanced proportions (orange and black boxplots).(E.g., for n = 10, 000 , note how the original estimator took much longer to compute the results from experiment 8 compared to experiments 7 and 9.) On the other hand, Fig. 6b shows that the computation times for the proposed estimator show only a slight increase in computation time with increasing sample sizes, and only take a fraction of a second to compute the tpAUC even for n = 10, 000 (note the different scales in the y-axis of Fig. 6a and b).Also, observe that the quantized computation times observed in Fig. 6b (at 0.00, 0.01, 0.02, 0.03, 0.04, and 0.05 seconds) represent an artifact of the precision with which timing results are reported by the system.timefunction.Despite this coarse precision, the results still clearly illustrate that the computational complexity of the proposed estimator is considerably smaller than the original one.In the next subsection, we present some additional computation time experiments.

Additional computation time comparisons
To more systematically compare the computation times of original and proposed estimators we performed additional computation time experiments spanning a wider range of sample sizes.Additionally, we compare the proposed estimator (Eq.24) to the alternative estimator, described in Eq. ( 26), which can be potentially faster to compute.(For this alternative estimator, the reported computation time corresponds to the sum of the time taken for: estimating the ROC curve; estimating the specificity value corresponding to the sensitivity threshold; and estimating of the pAUC sp if necessary.Note, as well, that we only report the computation time comparison because the estimates generated by the alternative and proposed estimators are identical.)We performed two sets of experiments.In the first, we investigated the computation times over 24 simulation experiments across a grid of sample sizes ranging from 100 to 20,000.In these first experiments, we simulated data from the binormal ROC model [18] adopting µ x = 2 , σ x = 1.5 , µ y = 0 , σ y = 1 , b sp = 0.6 , and b se = 0.6 , with an equal proportion of positive and negative cases and each experiment was replicated 100 times.Figure 7 reports the results.Panel a compares the original (red) and proposed (blue) and shows a sharp increase in computation time for the original estimator (it can take over 1 minute for n = 20, 000 ) while the proposed estimator still takes only a fraction of a second at this sample size.Panel b compares the proposed (blue) and the alternative (green) estimators.The blue and green solid lines represent the average computation time across the 100 experiment replications and shows that the alternative estimator tended to be slightly faster on average.This difference, however, is very small.(Note the small scale in the y-axis.)for the proposed estimator.Note the different scales in the y-axis of panels a and b.The boxplots report the computation time across the same 1,000 replications of each one of the 9 experiments presented in Fig. 5 (and described in Table 1) In the second set of experiments, we investigated the computation times over sample sizes ranging from 10,000 to 100,000.The data was generated as in the first series of experiments, except that, due to the long computation times required by the original estimator, we only performed a single replication per sample size.Figure 8 presents the results and panel a shows that the original estimator can take over 25 minutes (1,500 seconds) to compute for n = 100, 000 , while panel b shows that the computation times of the proposed and alternative estimators are comparable and take only a fraction of the time required by the original estimator (with the alternative estimator tending to be only slightly faster).

Bias comparison
As illustrated in the initial experiments based on small sample sizes (Fig. 5 a, b, and c) the tpAUC estimates generated by the original and proposed estimators can differ substantially suggesting that one (or both) of them might be fairly biased for small sample sizes.In order to investigate this issue, we systematically compared the bias of the estimators in a series of simulation experiments over a range of sample sizes and proportions of negative and positive cases across 16 different combinations of sensitivity and specificity thresholds.But before we describe the simulation study design and results, we first describe how we estimated the bias from the tpAUC estimators.
By definition, the bias of an estimator θ is given by, where θ corresponds to the true parameter value.In order to compare the bias of the original and proposed tpAUC estimators, we conducted a Monte Carlo study where we simulate multiple datasets from a binormal ROC model [18] and compare the average of the empirical estimates against the true tpAUC value.
Under the binormal ROC model, the positive and negative cases are distributed, respectively, according to x and Y ∼ N µ y , σ 2 y so that the true positive and false positive rates are given respectively by, where �(.) represents the cumulative distribution func- tion of a standard normal random variable.
Re-expressing the general definition of the pAUC fpr in Eq. ( 12) in terms of the binormal ROC model we have that, where φ(.) represents the probability density function of a standard normal (27) variable, and the integration limits in the above equation follow from the fact that, Similarly, the Area B in Eq. ( 12) is re-expressed as, and the true tpAUC is obtained by solving the integral in Eq. ( 30) using numerical integration and computing the Area B in Eq. ( 31) analytically, and then subtracting the latter value from the former.We compared the bias of the original and proposed estimators in 6 simulation experiments with sample sizes set to either 100 or 1000 and with the proportion of negative and positive cases set to (90%, 10%), (50%, 50%), and (10%, 90%), as described in Table 2.

Comparing the predictions of distinct classifiers
When comparing distinct classifiers built on the same dataset, the difference between the performance metric is a popular criterium for selecting the best classifier.To this end, Yang et al. [16] describe how to compute asymptotic confidence intervals for the difference of tpAUCs statistic, where tpAUC C 1 and tpAUC C 2 represent the estimates obtained by the classifiers C 1 and C 2 , respectively.However, because the predictions of classifiers trained in the same dataset tend to be correlated, the statistics tpAUC C 1 and tpAUC C 2 are not independent and the asymptotic theory for the statistic tpAUC is non- standard.To circumvent this problem, Yang et al. [16] (and [17] in the context of the pAUC metric) described a bootstrap-assisted approach to calculate asymptotic (33) tpAUC = tpAUC C 1 − tpAUC C 2 , Fig. 9 Bias comparison Each panel presents boxplots reporting the absolute value of the estimated bias for the original (red) and proposed (blue) estimators (across 100 replications of each experiment described in Table 2) for 16 different combinations of sensitivity and specificity thresholds confidence intervals for the tpAUC statistic.Namely, the 100(1 − α) % confidence interval is given by, where Z 1−α/2 corresponds to the 1 − α/2 quantile of a standard normal distribution and v 2 boot represents the nonparametric bootstrap [24] estimate of the variance of tpAUC, and B represents the number of bootstrap samples.Clearly, because the bootstrap estimate v 2 boot requires the estimation of the tpAUC metric B times for each classifier C 1 and C 2 , we have that this procedure becomes computationally impractical for larger datasets when we use the original trimmed Mann-Whitney U statistic estimator in Eq. ( 14).
We illustrate this point using the Diabetic Retinopathy [29] and the Sepsis Survival [30] UCI datasets [31].The Diabetic Retinopathy is a moderate size dataset ( n = 1, 151 ) containing features extracted from the Messidor image set [32], including quality assessment, pre-screening, lesion-specific and anatomical components.The classification task is to predict the presence or absence of diabetic retinopathy.The Sepsis Survival is a large dataset containing 110,341 sepsis hospitalization records on 84,811 subjects.Each record contains (34 the age, gender, and the sepsis episode number (since some individuals had multiple episodes of sepsis).To make sure that all data points were independent we only used the first record of sepsis of each individual in our analyses (so that n = 84, 811 ).The classification task was to predict if a subject died or survived based on their age and gender.
For each of these datasets we trained a logistic regression [33] and a random forest [34] classifier (using 50/50 training/test data split) and then compute 95% confidence intervals for the tpAUC statistic, using the original and the proposed estimators of tpAUC.(We adopted the default tuning parameter values implemented in the ranger [35] R package for the random forest classification.)For both datasets we adopt sensitivity and specificity thresholds set to 0.4 and Fig. 10 presents the ROC curves based on the predicted probabilities from the logistic regression and random forest classifiers.Table 3 reports the respective confidence intervals and the time spent in their calculation (alongside the estimates of tpAUC using the logistic regression and random forest classifiers for both the original and proposed estimators).The reported results are based on B = 1, 000 sequentially run bootstraps for the estimation of v 2 boot .Comparison of the estimates based on the original and proposed estimators show that the results are fairly close, and indicate that the logistic regression classifier is statistically better than the random forest classifier, since for both datasets the 95% confidence intervals do not contain (36) tpAUC = tpAUC logistic regression − tpAUC random forest , Fig. 10 Real data illustrations.Panel a compares the ROC curves generated from the predictions of a logistic regression and a random forest classifier on the Diabetic Retinopathy Debrecen dataset, while panel b presents the analogous comparison on the Sepsis Survival dataset.In both panels the red square captures the area of interest of the ROC space obtained by setting the sensitivity and specificity thresholds to b se = 0.4 and b sp = 0.4 , respectively 0. (Note that for the Sepsis Survival dataset, while the difference in the tpAUC estimates is very small, the difference is still statistically significant since the very large sample size leads to a very narrow confidence interval.) Despite the fact that the results based on the original and proposed estimators are quite similar, the computation times are drastically different.While the computation of the confidence interval based on the original estimator takes less than 6 minutes for the Diabetic Retinopathy dataset ( n = 575 ), it takes over 5.12 days for the Sepsis Survival dataset ( n = 42, 405 ).On the other hand, the computation time based on the proposed estimator increases from 3.24 seconds in the Diabetic Retinopathy dataset to only 46.69 seconds in the Sepsis Survival dataset.

Discussion
It is well understood that the nonparametric estimators of AUC and pAUC tend to be systematically biased in small sample sizes, especially when the ROC operating points are not well spread out along the ROC curve [36].However, as pointed in the literature [17,36,37] the amount of bias becomes negligible as the sample size increases.Our bias comparison experiments (Fig. 9) also show a drastic reduction in bias when sample size increases from 100 to 1,000 for both the original and proposed estimators suggesting that this observation also holds true for the tpAUC.The fact that the proposed estimator tended to show smaller amounts of bias in comparison to the original estimator might be due to the fact that the pROC R package performs linear interpolation for adding partial trapezoids (when necessary) during the calculation of the partial AUCs.
But most importantly, our experiments and real data illustrations show that the computation of the proposed estimator can be orders of magnitude faster than the original one.This result is not surprising given that the calculation of the proposed estimator leverages the computationally efficient routines implemented in the pROC R package (see the pROC package documentation for details).Note that the computational complexity of the proposed estimator in Eq. ( 24) is log-linear since its computation depends on the estimation of the ROC curve and of the AUC and partial AUCs scores (and all of these computations have O(n log n) complexity).On the other hand, the original estimator based on the trimmed Mann-Whitney U statistic in Eq. ( 14), has computational complexity of order O(n y n x ) as it relies on a double sum- mation running across all (i, j) pairs for i = 1, . . ., n x and j = 1, . . ., n y .Note that this O(n y n x ) complexity is most expensive for balanced populations where n x = n y as cor- roborated in our experiments (Fig. 6a), where the results from the experiments based on n x = n y took more time to compute.
We also compare the proposed estimator ( tpAUC p , in Eq. 24) against the alternative estimator ( tpAUC a , in Eq. 26), which is simply a direct implementation of the mathematical definition of tpAUC in Eq. 12 (originally proposed by [16]) using the pROC package.Our comparisons (Figs.7b and 8b) show that while the alternative estimator can be slightly faster than the proposed one, the small differences are unimportant in practice.(Note that the alternative estimator also has a O(n log n) com- putational complexity.) The R code used to implement the proposed estimator and reproduce all the experiments presented in this paper is available in the github repository, https:// github.com/ echai bub/ pROC_ based_ tpAUC.Additionally, the github repository https:// github.com/ Sage-Bione tworks/ tp_ AUC provides scripts to run the R code in the Python environment (using the rpy2 Python library 1 ) and makes the proposed estimator available to the Python users community as well.

Conclusions
In summary, as clearly demonstrated by our synthetic and real data illustrations, our proposed tpAUC estimator represents a computationally efficient alternative to the original estimator based on trimmed Mann-Whitney U statistic for moderate to large datasets, which also tends to be less biased in small datasets.But most importantly, the proposed estimator makes the calculation of bootstrap-based confidence intervals feasible, and opens the doors for the comparison of diagnostic tests/machine learning classifiers in large datasets where the serial computation of the original estimator is impractical.

Fig. 2
Fig. 2 Graphical representation of the two-way partial AUC.In both panels, Area A represents the tpAUC.Panel a shows the ROC curve defined in terms of FPR and TPR (sensitivity) where the sum of Areas A and B represents the partial AUC focusing on the FPR, and u 0 = c fpr = S G S −1 F (b se ) corresponds to the FPR value at the sensitivity threshold of b se and u 1 = b fpr = 1 − b sp .Panel b shows the ROC curve defined in terms of specificity and sensitivity where the sum of Areas A and B represents the partial AUC focusing on specificity, and c sp = 1 − S G S −1 F (b se ) corresponds to the specificity value at the sensitivity threshold of b se

( 15 )Fig. 3
Fig. 3 Graphical representation of the partial AUCs and the two-way partial AUC. a Partial AUC focusing on specificity ( pAUC sp ), showing the area restricted to specificity values greater or equal to 0.6 (i.e., b sp = 0.6 ).b Partial AUC focusing on sensitivity ( pAUC se ), showing the area restricted to sensitivity values greater or equal to 0.75 (i.e., b se = 0.75 ).c Simultaneous representation of both pAUC sp and pAUC se .d Two-way partial AUC (tpAUC ) as the intersection of pAUC sp and pAUC se tpAUC = pAUC sp + pAUC se − (AUC − b se b sp ) = (Area I + Area II) + (Area I + Area III)− − (Area I + Area II + Area III + Area IV − Area IV) = Area I .
c sp < b sp pAUC se + pAUC sp − (AUC − b se b sp ) , if c sp ≥ b sp .
graphically, to the colored area (i.e., Area I + Area II + Area III) in Fig.3cthat can also be reexpressed as (Area I + Area II + Area III + Area IV) -Area IV = AUC − b se b sp .On the other hand, if E 1 and E 2 are mutually exclusive thanP(E 1 ∩ E 2 ) = tpAUC = 0.The expression for tpAUC in Eq. (20) immediately suggests the following estimator, where ĉsp = 1 − ŜG,n y ( Ŝ−1 F ,n x (b se )) corresponds to the estimated specificity at the sensitivity bound b se (in our implementation we use the coords function from the pROC R package to compute this value).

Fig. 4
Fig.4 Conditions under which tpAUC = 0 .Panel a shows that because the ROC curve is monotonic it follows that the curve will not cross the region of interest (red rectangle) if the specificity at the b se sensitivity, denoted by c sp and highlighted by the blue arrow, is smaller than b sp .Panel b shows the equivalent condition in terms of sensitivity values, where we see that the ROC curve does not cross the region of interest if the sensitivity at the b sp specificity, c se , is smaller than b se .When these conditions hold, we have that the tpAUC is 0. In our implementation, we check the condition in panel a ĉfpr > b fpr pAUC fpr ĉfpr , b fpr − b fpr − ĉfpr b se , if ĉfpr ≤ b fpr ,

Fig. 5
Fig.5 Comparison the tpAUC estimates.Panels a to i present scatterplots of the tpAUC estimates based on the original (x-axis) and proposed (y-axis) estimators across the nine initial experiments described in Table1.The estimates tend to get very close (bottom panels) as sample size increases.Results from each experiment were based on 1,000 replications (see main text for details)

Fig. 6
Fig. 6 Computation time comparison.Panel a shows boxplots of the computation time for the original estimator for all 9 experiments grouped by their sample sizes and color coded according to the proportion of negative and positive cases.Panel b shows the analogous resultsfor the proposed estimator.Note the different scales in the y-axis of panels a and b.The boxplots report the computation time across the same 1,000 replications of each one of the 9 experiments presented in Fig.5(and described in Table1)

Fig.Fig. 8
Fig. Additional computation time experiments for sample sizes increasing from 100 to 20,000.Panel a reports boxplots (over 100 replications) of the computation times taken by the original (red) and proposed (blue) estimators.Panel b presents the analogous comparison between the proposed (blue) and the alternative (green) estimators

Figure 9
Figure9reports the results from these experiments.Each panel presents boxplots (across the 100 replications) of the absolute value of the estimated bias against the 16 different combinations of sensitivity and specificity thresholds.Overall the results suggest that the original estimator (red boxplots) tends to be more biased than the proposed one (blue boxplots).Observe, as well, that the amount of bias is considerably smaller for the experiments based on larger sample size ( n = 1, 000 ), presented in Fig.9d, e, and f, when compared to the experiments based on smaller samples ( n = 100 ) presented in Fig.9a, b, and c. (Note the different scale in the y-axis for the top and bottom panels in Fig.9).

Table 1
Simulation experiments.n x and n y represent the number of positive and negative cases, respectively the paper) were performed on a Windows machine with processor Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz 2.90 GHz and 64 GB of RAM.The computation time reported for the tpAUC p estimator corre- sponds to the sum of the time taken for: (i) estimating the ROC curve; (ii) estimating the specificity value corresponding to the sensitivity threshold (for checking whether the ROC crosses the area of interest); and (iii) the estimation of the pAUC se , pAUC sp , and AUC quantities.

Table 1 .
The estimates tend to get very close (bottom panels) as sample size increases.Results from each experiment were based on 1,000 replications (see main text for details)

Table 2
Simulation experiments.n x and n y represent the number of positive and negative cases, respectively

Table 3
Real data experiments.CI stands for confidence interval and rf and lr stand for random forest and logistic regression, respectively